Conversation
…ver the FB-generated cols (not all cols/metadata); (2) edit the log headers to specify what file the log is for
There was a problem hiding this comment.
Pull request overview
This PR adds structured logging and performance/diagnostic output to the FeatureBuilder pipeline, threads a logger through the feature calculators and embedding generation, and introduces an opt-in post-processing step to identify and optionally drop sparse/redundant feature columns. It also updates documentation to reflect the new diagnostics and the deprecation/removal of analyze_first_pct.
Changes:
- Add
setup_loggerutility and wire structured loggers throughFeatureBuilder, calculators, and embedding generation (plus perf timings). - Add post-featurization redundancy diagnostics (
generate_summary_stats) with opt-in column dropping (drop_redundant_columnsand related thresholds). - Update docs/examples and rebuild HTML docs artifacts; extend
.gitignorefor logs/CSVs.
Reviewed changes
Copilot reviewed 66 out of 69 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/team_comm_tools/utils/preprocess.py | Adds reusable setup_logger helper to create log dirs and avoid duplicate handlers. |
| src/team_comm_tools/utils/check_embeddings.py | Adds perf timing + logger usage; narrows torch imports; updates function signature. |
| src/team_comm_tools/utils/calculate_user_level_features.py | Adds perf timings for major user-level feature steps and logs durations. |
| src/team_comm_tools/utils/calculate_conversation_level_features.py | Adds per-feature-method perf timings and logs durations. |
| src/team_comm_tools/utils/calculate_chat_level_features.py | Adds per-method perf timings and logs durations; extends constructor to accept a logger. |
| src/team_comm_tools/feature_builder.py | Sets up loggers, adds run header + timings, removes first-% loop, adds redundancy diagnostics and optional column dropping. |
| docs/source/examples.rst | Documents drop_redundant_columns and marks first-% analysis as deprecated/removed. |
| docs/source/basics.rst | Updates customizable-parameters list and adds redundancy-reduction section. |
| docs/build/html/utils/zscore_chats_and_conversation.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/utils/summarize_features.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/utils/preprocess.html | Rebuilt HTML docs artifact; includes setup_logger in docs nav. |
| docs/build/html/utils/preload_word_lists.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/utils/index.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/utils/gini_coefficient.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/utils/check_embeddings.html | Rebuilt HTML docs artifact reflecting signature/logger updates. |
| docs/build/html/utils/calculate_user_level_features.html | Rebuilt HTML docs artifact reflecting logger parameter. |
| docs/build/html/utils/calculate_conversation_level_features.html | Rebuilt HTML docs artifact reflecting logger parameter. |
| docs/build/html/utils/calculate_chat_level_features.html | Rebuilt HTML docs artifact reflecting logger parameter. |
| docs/build/html/utils/assign_chunk_nums.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/search.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/py-modindex.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/intro.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/index.html | Rebuilt HTML docs artifact (formatting/pygments whitespace changes). |
| docs/build/html/genindex.html | Rebuilt HTML docs artifact; index now includes new FeatureBuilder methods. |
| docs/build/html/features/word_mimicry.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/within_person_discursive_range.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/variance_in_DD.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/turn_taking_features.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/textblob_sentiment_analysis.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/temporal_features.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/reddit_tags.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/readability.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/question_num.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/politeness_v2.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/politeness_v2_helper.html | Rebuilt HTML docs artifact; docstring formatting/structure updates. |
| docs/build/html/features/politeness_features.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/other_lexical_features.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/named_entity_recognition_features.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/lexical_features_v2.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/information_diversity.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/info_exchange_zscore.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/index.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/hedge.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/get_user_network.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/get_all_DD_features.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/fflow.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/discursive_diversity.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/certainty.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/burstiness.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features/basic_features.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features_conceptual/word_ttr.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features_conceptual/turn_taking_index.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features_conceptual/TEMPLATE.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features_conceptual/positivity_bert.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features_conceptual/named_entity_recognition.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features_conceptual/moving_mimicry.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features_conceptual/mimicry_bert.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features_conceptual/index.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features_conceptual/function_word_accommodation.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/features_conceptual/content_word_accommodation.html | Rebuilt HTML docs artifact (static asset hash changes). |
| docs/build/html/feature_builder.html | Rebuilt HTML docs artifact reflecting new params/methods and removal of first-% method. |
| docs/build/html/examples.html | Rebuilt HTML docs artifact with redundancy section and first-% deprecation. |
| docs/build/html/.buildinfo | Rebuilt Sphinx build metadata. |
| docs/build/html/_static/searchtools.js | Updates Sphinx search JS (includes a selector change needing correction). |
| docs/build/html/_static/pygments.css | Rebuilt pygments CSS (color normalization/formatting changes). |
| docs/build/html/_sources/examples.rst.txt | Rebuilt RST source artifact reflecting docs/source/examples.rst changes. |
| .gitignore | Ignores .claude/, and adds *.csv/*.log patterns. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The test_drop_redundant_columns.py tests (committed in 93b074e) read this CSV, but it was silently skipped by the *.csv .gitignore rule, leaving the test without its fixture on the remote. Force-add it so the test is runnable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The blanket *.csv ignore rule silently dropped new test fixtures (e.g. test_redundant_columns.csv), requiring git add -f. Negate the rule for the fixture directory so fixtures stage normally; CSVs elsewhere stay ignored. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Collaborator
|
Made a few updates:
|
xehu
approved these changes
Jun 1, 2026
The test workflow invokes pytest file-by-file, so test_drop_redundant_columns.py was never collected. Add it to the Run tests step. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
setup_loggerutility insrc/team_comm_tools/utils/preprocess.pythat writes timestamped logs to./<output_file_base>/logs/, auto-creates the directory, and guards against duplicate handlers / propagation to root.FeatureBuilder—feature_builder.logfor top-level run info andsummary_details.logfor verbose per-column output — and threads the logger throughChatLevelFeaturesCalculator,UserLevelFeaturesCalculator,ConversationLevelFeaturesCalculator, andcheck_embeddings.printandwarnings.warncalls for errors / invalid configs are now mirrored to the log.perf_counter-based timings around each feature method (chat / user / conversation level) and around sentence-vector and BERT generation incheck_embeddings.py, so the log captures per-step durations.generate_summary_stats) that, for each output level, reports columns with high NA ratios, high zero ratios, and groups of highly correlated columns (Spearman, configurable threshold). New constructor params:corr_thresh,min_na_ratio,min_zero_ratio,min_group_size,treat_zero_as_na,drop_redundant_columns. Withdrop_redundant_columns=True, columns with NAs/zeros that exceeding the thresholds are dropped. Moreover, only one representative per correlated group is kept (chosen by valid-data count and variance) and others in the group are dropped.featurize.check_embeddings.py(drops top-leveltorch/ unusedutilin favor of narrowerfrom torch import cuda, no_grad).feature_builder.py.*.csvand*.logto.gitignore.Behavior change to flag for review
As discussed, the first x percent feature is deprecated.
analyze_first_pctandget_first_pct_of_chatare commented out infeature_builder.py, removing the multi-truncation loop infeaturize.